InfoMagic Internet Tools 1995 April

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Internet Tools 1995 April / Internet Tools.iso / infoserv / www / cern / dev / www-talk.9301-9306.Z / www-talk.9301-9306 / text0575.txt < prev next >

Wrap

Text File | 1995-04-24 | 3.6 KB | 73 lines

>I could easily write a robot which would roam around the Web (perhaps >stochastically?), and verify the html, using sgmls. Then, whenever >I come across something that's non-compliant, I could automatically >send mail to wwwmaster@sitename. No one would have to annoy anyone else >about whether or not they've verified their HTML; a program would annoy >them automatically. I have written a robot that does this, except it doesn't check for valid SGML -- it just tries to map out the entire web. I believe I found roughly 50 or 60 different sites (this was maybe 2 months ago -- I'm sorry, I didn't save the output). It took the robot about half a day (a saturday morning) to complete. There were several problems. First, some sites were down and my robot would spend a considerable time waiting for the connection to time out each time it found a link to such a site. I ended up remembering the last error from a site and skipping sites that were obviously down, but there are many different errors you can get, depending on whether the host is down, unreachable, doesn't run a WWW server, doesn't recognize the document address you want, or has some other trouble (some sites were going up and down while my robot was running, causing additional confusion). Next, more importantly, some sites have an infinite number of documents. There are several causes for this. First, several sites have gateways to the entire VMS documentation (I have never used VMS but apparently the VMS help system is a kind of hypertext). While not exactly infinite the number of nodes is *very* large. Luckily such gateways are easily recognized by the kind of pathname they use, and VMS help is unlikely to contain pointers to anything except more VMS help, so I put in a simple trap to stop these. Next, there are other gateways. I can't remember whether I encountered a Gopher or WAIS gateway, but these would have even worse problems. Finally, some servers contain bugs that cause loops, by referencing to the same document with an ever-growing path. (The relative path resolving rules are tricky, and I was using my own www client which isn't derived from Tim's, which made this more severe, but I have also found occurrences reproducible with the CERN www client.) Although I didn't specifically test for bad HTML, I did have to parse the HTML to find the links, and found occasional errors. I believe there are a few binaries, PostScript and WP files that have links to them, which take forever to fetch. There were also various occurrences of broken addresses here and there -- this was a good occasion for me to debug my www client library. If people are interested, I could run the robot again and report a summary of the results. I also ran a gopher robot, but after 1600 sites I gave up... The Veronica project in the Gopher world does the same and makes the results available as a database, although the last time I tried it the veronica server seemed too overloaded to respond to a simple query. If you want source for the robots, the're part of the Python source distribution: ftp to ftp.cwi.nl, directory, pub/python, file python0.9.8.tar.Z. The robot (and in fact my entire www and gopher client library) is in the tar archive in directory python/demo/www. The texinfo to html conversion program that I once advertized here is also there. (I'm sorry, you'll have to built the python interpreter from the source before any of these programs can be used...) Note that my www library isn't up with the latest HTML specs, this is a hobby project and I neede my time for other things... --Guido van Rossum, CWI, Amsterdam <Guido.van.Rossum@cwi.nl>